Putting small and big pieces together: a genome assembly approach reveals the largest Lamiid plastome in a woody vine

The plastid genome of flowering plants generally shows conserved structural organization, gene arrangement, and gene content. While structural reorganizations are uncommon, examples have been documented in the literature during the past years. Here we assembled the entire plastome of Bignonia magnifica and compared its structure and gene content with nine other Lamiid plastomes. The plastome of B. magnifica is composed of 183,052 bp and follows the canonical quadripartite structure, synteny, and gene composition of other angiosperms. Exceptionally large inverted repeat (IR) regions are responsible for the uncommon length of the genome. At least four events of IR expansion were observed among the seven Bignoniaceae species compared, suggesting multiple expansions of the IRs over the SC regions in the family. A comparison with 6,231 other complete plastomes of flowering plants available on GenBank revealed that the plastome of B. magnifica is the longest Lamiid plastome described to date. The newly generated plastid genome was used as a source of selected genes. These genes were combined with orthologous regions sampled from other species of Bignoniaceae and all gene alignments concatenated to infer a phylogeny of the family. The tree recovered is consistent with known relationships within the Bignoniaceae.

Plastome expansions have been reported in unrelated lineages of angiosperms, such as Annona L. (Annonaceae, Magnoliales, with up to 201,723 bp;Blazier et al., 2016), and Cypripedium L. (Orchidaceae, Asparagales, with up to 212,668 bp;Guo et al., 2021). The family Geraniaceae shows a remarkable example of plastome and IR size variation, with plastid genomes ranging from 128.7 kbp in Monsonia speciosa L. to 242.5 kbp in Pelargonium transvaalanse R Knuth (Chumley et al., 2006;Guisinger et al., 2010;Weng, Ruhlman & Jansen, 2017). Pelargonium transvaalanse has the largest plastome known to date with IRs larger than 87.4 kbp. This plant family is also famous for extreme reconfigurations of plastid genomes, showing multiple arrangements and repeats (Guisinger et al., 2010;Weng, Ruhlman & Jansen, 2017). Despite the giant IRs, structural variation, and differences in plastome size, Pelargonium L'Hér and other members of Geraniaceae carry a regular number of protein-coding genes and the usual 29 tRNAs (Guisinger et al., 2011). The data available for the Geraniaceae and other flowering plant lineages suggest that these plastome increments in size and reconfigurations are not necessarily associated with relevant changes in gene expression or the overall function of the organelle.
Expansions of the IRs and multiple genomic arrangements have also been described for the tropical Tribe Bignonieae (Firetti et al., 2017;Fonseca & Lohmann, 2017;. Bignonieae plastomes range from 155 to 158 kbp in size, although some taxa from the informally named ''Multiples of Four Clade'' (i.e., a clade that shares multiples of four phloem wedges; see Lohmann, 2006) show significant increases in plastome size. Namely, the plastome of Amphilophium steyermarkii (AH Gentry) LG Lohmann is 164,786 bp long , while the plastome of Anemopaegma acutifolium DC. is 168,987 bp long (Firetti et al., 2017). Multiple independent advances of the IRa over the LSC possibly occurred within Amphilophium Kunth emend LG Lohmann  Thode, Sanmartín & Lohmann, 2019), while the IRs are identical in terms of gene composition, with differences of hundreds of bases among species in Anemopaegma Mart. ex Meisn. (Firetti et al., 2017). Bignonia is part of the ''Multiples of Four Clade'' however, plastomes are not available for the genus. A chloroplast genome of this clade is critical for a better understanding of the patterns and possible processes behind plastome evolution in tribe Bignonieae and the ''Multiples of Four Clade.'' Here we selected Bignonia magnifica to sequence the first plastome of the genus (Fig. 1). This species is a tropical liana, native from Ecuador and Colombia, but widely cultivated around the world. By combining short-read Illumina data and long-read PacBio data, we assembled the whole plastome of Bignonia magnifica (Fig. 2). This hybrid approach has shown improved assembly accuracy to determine the plastome structure and sequence in flowering plants (Wu et al., 2014;Wang et al., 2018;Syme et al., 2020;Guo et al., 2021). Although the long reads obtained from the Pacbio platform have a higher error rate than short reads sequenced on Illumina platforms, the use of both technologies can help to reveal striking features regarding structural complexity in plastomes (Guo et al., 2021). We further compared the plastome of B. magnifica with genomic data available for selected species of Bignoniaceae and outgroups and evaluated the plastome size, structure, gene composition, presence of repetitive regions, and phylogenetic relationships.  Tissuelyzer R (Qiagen, Duesseldorf, Germany) for 3 min at 60 Hz. Total genomic DNA was extracted using the Invisorb R Spin Plant Mini Kit (Invitek, Berlin, Germany). For Illumina sequencing, we followed the methodology described in Fonseca & Lohmann (2017). In short, the genomic DNA (∼5 µg) was fragmented using Covaris S-series sonicator, generating DNA fragments of around 300 bp. A genomic library was built using NEBNext DNA Library Prep Master Mix Set and the NEBNext Multiplex oligos for Illumina (New England BioLabs Inc., Ipswich, MA). The final library of B. magnifica was diluted to 10 nM and pooled together with other 19 non-target species in one lane and sequenced (paired-end, 2 × 100 bp) on an Illumina HiSeq2000 system (Illumina, Inc., San Diego, CA, USA) at the University of São Paulo (Escola Superior de Agricultura Luiz de Queiroz) in Piracicaba, Brazil.

Sampling and genome sequencing
For PacBio R sequencing, library preparation and sequencing followed the manufacturer's instructions (Pacific Biosciences). In short, 10 µg of genomic DNA was isolated and fragmented to 7-20 kbp using HydroShear. A genomic library was constructed using the steps and parameters described in the manual (Pacific Biosciences). After the DNA fragment size selection, one PacBio library was constructed using SMRTbell Template Prep Kits 1.0, and one SMRT cell was sequenced by PacBio Sequel platform using Sequel TM sequencing Kit 1.2.1 at Duke Center for Genomic and Computational Biology (Duke University School of Medicine, Durham, NC, USA).

Plastome assembly and annotation
To assemble the plastome data obtained with Illumina, we used the pipeline GetOrganelle 1.7.4.1 (Jin et al., 2020) with default parameters. Adaptors were removed and low-quality sequences trimmed using Trimmomatic 0.35 (Bolger, Lohse & Usadel, 2014) with the SLIDINGWINDOW:10:20 and MINLEN:40 parameters. Trimmed reads were used as input using the script ''get_organelle_from_reads.py'', which is the main workflow of GetOrganelle (Jin et al., 2020). This script uses Bowtie2 (Langmead & Salzberg, 2012), BLAST (Camacho et al., 2009), and SPAdes 3.1.0 (Bankevich et al., 2012), as well as Python Numpy libraries and Sympy dependences. The pipeline starts mapping the reads against a database of plastomes used as seeds with Bowtie2. The initial target-associated reads were treated as ''baits'' to increase the number of plastome reads through multiple extension interactions. SPAdes was used to build a de novo FASTA assembly Graph (FASTG), and BLAST was used to remove any non-target sequences retained. The slimmed FASTG file was used to calculate all paths of the complete target organelle using the structure of the graph and coverage information.
To assemble the plastome using the data obtained through PacBio, we assembled the reads de novo into a number of contigs using the SMRT Analysis package 2.3 (Pacific Biosciences) with the HGAP3 parameter followed by polishing with the Quiver algorithm. A k-mer analysis was performed on each of the assemblies individually using a k-mer size of 20 using Jellyfish 2.1.3 (Marçais & Kingsford, 2011). Plastome sequences were fished out using BLAT (Kent, 2002) and the plastome of A. arvense as reference.
The assemblies obtained using sequences from Illumina and PacBio for B. magnifica were annotated using GeSeq (Tillich et al., 2017) with default parameters (https://chlorobox.mpimp-golm.mpg.de/index.html). The assemblies of B. magnifica were evaluated and manually compared using Geneious 9.0.2. By combining the information derived from both assemblies, we were able to establish the LSC, SSC, and IRs limits with high confidence and provide a complete plastome for B. magnifica. Both assemblies were largely congruent, with only small base pair differences and small indel differences (10-20 bp) observed between assemblies. Base mismatches followed the results obtained using the Illumina data, while indels followed the results obtained with PacBio. The final plastome assembly was verified using a coverage analysis implemented in Jellyfish 2.3.1 (Marçais & Kingsford, 2011). The estimate of 25-mer Table 1 General features of the Bignonia magnifica and other nine Lamiid plastomes, showing number of base pairs (bp) in different genome regions (i.e., LSC, large single copy; SSC, small single copy; IR, inverted repeats). The percentage of guanina-citosine (GC) and the number of genes across the IR, including protein-coding genes (CDS), ribosomal RNA (rRNA), and transfer RNA (tRNA) genes, are also presented.

Plant species
Genbank accession abundance was used to map a 25-bp sliding window of coverage across the plastome. Junctions of the quadripartite structure were tested interactively using the program afin (https://github.com/mrmckain/Fast-Plast/tree/master/afin). The final plastome map was produced using OGDRAW (Lohse, Drechsel & Bock, 2007). The NCBI accession number of the complete plastome is available in

Phylogenetic analyses
To infer the phylogenetic placement of B. magnifica within the Bignoniaceae, we used the same 80 plastid genes (Table 1) used to infer the phylogeny of angiosperms by Li et al. (2019). DNA sequences were aligned using MAFFT 7.309 (Katoh & Standley, 2013). The alignments were edited using GBlocks (Talavera & Castresana, 2007) and regions found in less than 50% of the species were deleted. PartitionFinder2 (Stamatakis, 2006;Lanfear et al., 2014;Lanfear et al., 2016) was used to estimate partition schemes and molecular evolutionary models for each of the 80 plastid genes. Phylogenetic reconstructions were conducted using 12 species. Regions obtained from the newly assembled plastome were combined with sequences from the nine other species used here for structural comparisons, plus two other species with plastid genomes available and used as outgroups, Sesamum indicum L. (Pedaliaceae) and Scrophularia dentata Royle ex Benth. (Scrophulariaceae). Phylogenetic inferences were conducted using Maximum Likelihood (ML) in RAxML 8.2.9 (Stamatakis, 2014), and Bayesian Inference (BI) with MrBayes 3.2 (Ronquist et al., 2012). Branch support for ML was estimated using 1,000 bootstrap replicates (bs) and for BI using posterior probabilities (pp).

Plastome assembly
We sequenced the complete plastome of B. magnifica using Illumina and PacBio technologies. For the sequences generated with Illumina, we obtained 3,563,896 paired-end reads after the adaptors were removed and low-quality sequences trimmed. For R1 and R2, we obtained 352,991 and 59,697 non-paired reads, respectively. The mean length read was 81.8 bp long. After running the pipeline GetOrganelle, the maximum contig obtained was 182,643 bp, corresponding to almost the entire plastome sequence. The PacBio sequencing generated 150,292 (1.02 Gb) single-molecule long subreads in total, with an average length of 6,777 bp, and N50 of 14,789 bp. Overall, 172 contigs were identified, the longest of which was 151,204 bp in length, and represented the partial plastome with one copy of the IR. Results obtained using both sequencing technologies were visually compared to reach the final plastome size and evaluate the boundaries between LSC, SSC, and IRs. Junctions of the quadripartite structure were tested interactively and recovered in all analyses. The mean plastome coverage obtained using Illumina data was 825.8×, with more than 99% of the bases with coverage equal or larger than 100×.

Plastome features
The final plastome size of B. magnifica is 183,052 bp long. The genome has the typical quadripartite structure of angiosperms, which consists of a pair of IR regions (54,727 bp), a Notes. *Gene with one intron. ** Gene with two introns. a Gene with two copies.
LSC region (60,832 bp), and an SSC region (12,766 bp) (Table 1). A circular plastome map of B. magnifica is shown in Fig. 2. The average GC content is 37.4%. The plastome includes 157 genes, representing 110 coding regions, 39 tRNAs, and eight rRNA (Table 1, Table 2). Thirteen different genes have at least one intron, while three genes have two introns (i.e., clpP, rps12, and ycf3) ( Table 2). For rps12, a trans-splicing event was observed with the 5 end located in LSC, and a duplicated rps12 3 in the IRs (Table 2). Among protein-coding genes, 122 started with the standard initiator AUG. The rps19 and ndhD are exceptions, with rps19 starting with GUG and ndhD starting with ACG. The stop codon UAA was the most common, followed by UAG and UGA. We identified 688 repetitive motifs for B. magnifica using Phobos. These tandem regions ranged from a single nucleotide repetition (mononucleotide) to 52-nucleotide repetitive regions. These regions represented 6,6% of the total genome size. Among all species Table 3 Total  analyzed, L. origanoides showed fewer repetitive regions (567 in total), while A. arvense showed the highest number of repetitive regions (694 in total). The number of each class of repetitive region (i.e., mononucleotide, dinucleotide, or trinucleotide) differed among species but showed similar numbers within the Bignoniaceae. Mononucleotide repetitive regions were the most common among Bignoniaceae species, while dinucleotide repetitions were the least common. The same pattern was observed in L. origanoides, although a lower number of repetitive regions was observed for each class (Table 3).

Comparative plastome structure and size
Differences in the IR/SSC boundary region were observed when the plastome of B. magnifica was compared to the plastomes of L. origanoides and T. capensis. These differences suggest a reduction of the SSC due to the incorporation of the entire ycf1 gene and part of the rps15 gene into the IRs in the clade composed of C.  Figs. 3 and 4). The incorporation of genes within the IRs led to a massive increase in size of the plastomes of B. magnifica and other members of the ''Multiples of Four Clade.'' As described above, at least four gene translocations due to IR expansion occurred (Figs. 3 and 4). The first movement was observed in the SSC-IRa boundary. The second and third movements changed the structure of the LSC-IRb boundary. Group 2 also incorporated part of the gene petD, while Group 3 incorporated the entire petD and part of the petB genes as part of the IR.  , rpl22, rps3, rpl16, rpl14, rps8, infA, rpl36, rps11, rpoA, trnH, psbA, trnK, matK, rps16, petD, petB, psbH, psbN, psbT, pbsB, clpP, rps12, rpl20, rps18 rpl22, rps3, rpl16, rpl14, rps8, infA, rpl36, rps11, rpoA, and partial petD Group 1 ycf1, and partial rps15 rps19, rpl22, rps3, rpl16, rpl14, rps8, infA, rpl36, rps11, rpoA, petD The fourth movement combined expansions at both the LSC-IRa and LSC-IRb boundaries (Figs. 3 and 4). The boundary between SSC-IRb is constant within Bignoniaceae. Structural differences were observed between members of the Bignoniaceae and the plastome of L. origanoides (Fig. 4)  that this transition is exclusive of L. origanoides or exclusive to an internal clade of the family.

Multiples of Four
No differences in gene order were observed within B. magnifica, when compared to the other Bignonieae plastomes analyzed. While an apparent change in synteny was observed for B. magnifica, this gene block movement seems to have been caused by the incorporation of genes close to the LSC-IRa boundary by the IR (Fig. S1). Rearrangements were observed for the Bignonieae clade composed of A. arvense and P. venusta. The inversion of the region containing the genes ycf2, trnI, and trnL is consistent with earlier findings for Anemopaegma as a whole (Firetti et al., 2017).

Phylogenetic analyses
The phylogenetic hypotheses reconstructed with Maximum Likelihood and Bayesian Inference using a dataset composed of 80 plastome genes and 12 species showed identical topologies. The outgroups S. indicum L. and S. dentata were used to root the trees. The family Bignoniaceae, the Tribe Bignonieae, and the ''Multiples of Four Clade'' emerged as monophyletic, confirming earlier phylogenetic findings (Lohmann, 2006;Olmstead et al., 2009). The position of B. magnifica as part of the ''Multiples of Four Clade'' was also recovered (Lohmann, 2006). While all clades showed maximum or high support in the Bayesian analysis, two relationships were poorly supported in the ML analysis: (i) the sister-group relationship between Catalpa and a clade composed of Bignonieae + Tabebuia nodosa (61 bs); and (ii) the Bignonieae clade composed of B. magnifica, A. arvense, and P. venusta (65 bs; Fig. 2). These relationships were also poorly supported in earlier studies (Lohmann, 2006;Olmstead et al., 2009), suggesting recalcitrant points in the phylogeny.

DISCUSSION
In this study, we assembled the plastome of Bignonia magnifica and compared it with nine other Lamiid plastomes. The recovered B. magnifica plastome follows the canonical quadripartite structure, synteny, and gene composition found in other angiosperm plastomes. The total number of repetitive regions and the number observed for each class of repetitive regions is similar to that observed in other species of the family. Remarkable differences were observed in the size of the IRs, the longest in the family, and responsible for the largest plastome available to date for the entire Lamiids. Five events of IR expansion were observed within the eight Bignoniaceae species compared, suggesting multiple expansions of the IRs over the SC regions. The newly generated plastid genome was used as a source of selected genes. These genes were combined with orthologous regions sampled from other species of Bignoniaceae and all gene alignments concatenated to infer a phylogeny of the family. The tree reconstructed here recovered a monophyletic Bignoniaceae and a monophyletic tribe Bignonieae, corroborating previous findings. The topology recovered here also confirmed the monophyly of the ''Multiples of Four Clade,'' and previously recovered relationships within this lineage.
The total number of repetitive regions or the number of repetitive regions within each class was also similar among the species analyzed (Table 3). Repetitive sequences and IR expansions are correlated and involved in syntenic disruptions of plastomes (Chumley et al., 2006). Here the IR expansion of B. magnifica could not be associated with massive structural disruptions (Fig. S1), nor with an increase in the number of repetitive regions (Table 3). This result was also observed in other members of the ''Multiple of Four Clade'' sampled (Firetti et al., 2017;, suggesting that the IR expansions within the clade are not leading to increases in the number of repetitive regions nor to the accumulation of rearrangements. The phylogeny recovered here is congruent with a previous phylogeny of the Bignoniaceae (Olmstead et al., 2009), and a previous phylogeny of Bignonieae that placed B. magnifica within the ''Multiples of Four Clade'' (Lohmann, 2006). The phylogeny reconstructed here aimed to provide an evolutionary framework within which to compare selected Bignoniaceae plastomes. While sampling is reduced, inflating the support of most nodes, and simplifying the possible implications of the results, two nodes were poorly supported, including the node leading to the clade composed of A. arvense, P. venusta, and B. magnifica (Fig. 2). Recalcitrant branches were previously observed in Adenocalymma (Fonseca & Lohmann, 2018) and Amphilophium (Thode, Sanmartín & Lohmann, 2019) illustrating some limitations of plastome data for phylogeny reconstruction.

The making of large plastomes
The plastome of B. magnifica recovered in this study is the largest known to date for the entire Lamiid (Fig. 4). A dramatic increase in IR size led to a plastid genome with 183,052 bp, which is 14,065 bp longer than the plastome of Anemopaegma acutifolium DC., the second largest in the Bignoniaceae. Plastome size increase due to IR expansion over the LSC regions has been described for Anemopaegma (Firetti et al., 2017) and Amphilophium . These two genera, as well as Bignonia, Mansoa, and Pyrostegia, are part of the ''Multiples of Four Clade'' (Lohmann, 2006). Within the clade, at least three events of plastome increase were observed (Fig. 3). The expansions of the IR over the LSC region observed for B. magnifica involved the capture of LSC regions from both LSC/IRa and LSC/IRb boundaries and likely resulted from independent IR expansions (Fig. 3).
The results obtained here bring new insights into plastome evolution. However, the elucidation of the exact number, mechanism, and when those expansions occurred throughout the clade requires an improved sampling of plastomes within Tribe Bignonieae and the ''Multiples of Four Clade''. In total, ten Anemopaegma plastomes are available, all of which are homogenous in terms of structure and size; however, a higher number of Anemopaegma plastomes is needed so generalizations can be made (Firetti et al., 2017). Differences in plastome size and IR expansions were observed among the 11 complete Amphilophium plastomes sequenced to date ; however, the species sharing structural patterns are not necessarily closely related and no clear phylogenetic pattern was observed (Thode, Sanmartín & Lohmann, 2019).
While the plastome of B. magnifica is giant within Bignonieae and other Lamiids, 40 other plastomes from diverse angiosperm clades are larger than the plastome of B. magnifica (Fig. 5;Guisinger et al., 2010;Weng, Ruhlman & Jansen, 2017). Almost all of these plastomes share the expansions of the IRs over SC regions as the main mechanism responsible for their large sizes (Blazier et al., 2016;Zhang et al., 2016;Weng, Ruhlman & Jansen, 2017;Ruhlman & Jansen, 2018;Lee, Ruhlman & Jansen, 2020). These findings highlight the importance of these expansions for plastid genome size and gene composition. As these expansions of the IRs are found throughout the ''Multiples of Four Clade'', more plastomes with expansions are expected, some of which might be larger than the plastome of B. magnifica documented here.
Expansions of IRs are linked to some plastome properties, such as the number of repetitive regions and the frequency of rearrangements (Guisinger et al., 2011;Weng, Ruhlman & Jansen, 2017;Lee, Ruhlman & Jansen, 2020). No significant differences were observed when we compared the plastome of B. magnifica with nine other genomes in terms of the number of repetitive regions and its synteny (Table 3, Fig. S1). Improved sampling within the ''Multiples of Four Clade'' would allow statistical testing and the implementation of comparative methods to evaluate putative correlations between plastome size, DNA sequence, and structural properties (Weng, Ruhlman & Jansen, 2017).
The reduction of substitution rates on genes in the IR (when compared to SC genes) is also worth noting (Zhu et al., 2016;Weng, Ruhlman & Jansen, 2017). The two identical IR copies provide a template for error correction when a mutation occurs in one of the copies, likely suppressing substitution rates in the IR. When the IRs incorporate genes, substitution rates are expected to decrease in those regions. While this expectation was tested in Pelargonium, no significant correlations were found (Weng, Ruhlman & Jansen, 2017). These findings illustrate that the effect of IR expansion/contraction on substitution rates may not be relevant or easily detectable. New molecular data on Pelargonium and other plant groups are necessary to properly test this prediction. The diversity of plastomes found within the ''Multiples of Four Clade'' makes this group an excellent model within which to test hypotheses about plastome evolution.

CONCLUSION
The complete plastome of B. magnifica showed the striking dimensions that these genomes can reach within the family, especially within Tribe Bignonieae. Some patterns were recovered when plastomes are compared in lineages with IR expansions, however, rigorous tests are still necessary to formally evaluate the patterns encountered and the putative underlying causes. Indeed, new data is still needed to answer many open questions, such as: (i) Are these expansions or contractions related to plastome rearrangements? (ii) Are the expansions or contractions related to an increase or decrease in the number of repetitive regions? (iii) Is it possible to observe differences in substitution rates for genes found in different compartments of the genome? The dozens of complete plastomes available for the Tribe Bignonieae to date (Firetti et al., 2017;Fonseca & Lohmann, 2017; contribute important data and bring new insights into the molecular patterns. The extensive phylogenetic data available (Firetti et al., 2017;Fonseca & Lohmann, 2018;Thode, Sanmartín & Lohmann, 2019) or to be published soon, combined with more complete plastomes for members of Bignonieae provide a strong basis for future studies on plastome evolution in the clade. In this sense, the plastome of Bignonia magnifica is a significant step forward, showing new molecular patterns inside tribe Bignonieae.